Text Analysis Activity

Question 1

Here I import our data and create a document-term matrix.

library(tm)
## Warning: package 'tm' was built under R version 3.6.2
## Loading required package: NLP
library(SnowballC)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Loading Beatles Data
  beatles.lyrics<-read.csv("./lyrics_beatles.csv", header=TRUE, stringsAsFactors=FALSE)

#Looking at structure of data
  str(beatles.lyrics)
## 'data.frame':    187 obs. of  4 variables:
##  $ ï..songs_title   : chr  "I Saw Her Standing There" "Misery" "Anna (Go To Him)" "Chains" ...
##  $ songs_Writers    : chr  "Writer(s): JOHN LENNON, PAUL MCCARTNEY" "Writer(s): John Winston Lennon, Paul James Mccartney" "Writer(s): ARTHUR ALEXANDER" "Writer(s): Gerry Goffin, Carole King" ...
##  $ songs_Song_Lyrics: chr  "(1,2,3,4!)\nWell, she was just seventeen\nYou know what I mean\nAnd the way she looked was way beyond compare\n"| __truncated__ "The world is treating me bad... Misery\nI'm the kind of guy\nWho never used to cry\nThe world is treating me ba"| __truncated__ "Anna\nYou come and ask me, girl\nTo set you free, girl\nYou say he loves you more than me\nSo I will set you fr"| __truncated__ "Chains, my baby's got me locked up in chains\nAnd they ain't the kind that you can see\nWhoa, oh, these chains "| __truncated__ ...
##  $ Year             : int  1962 1962 1962 1962 1962 1962 1962 1962 1962 1962 ...
# Assigning Data to Corpus
      corpus<-Corpus(VectorSource(beatles.lyrics[,3]))
      corpus
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 187
#Converting everything to lower case and removing punctuation, stopwords, and numbers
      corpus <- tm_map(corpus, tolower)
## Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops
## documents
      corpus[[1]]$content #Taking a look at first song to see how lower case looks
## [1] "(1,2,3,4!)\nwell, she was just seventeen\nyou know what i mean\nand the way she looked was way beyond compare\nso how could i dance with another (ooh)\nwhen i saw her standing there\nwell she looked at me, and i, i could see\nthat before too long i'd fall in love with her\nshe wouldn't dance with another (whooh)\nwhen i saw her standing there\nwell, my heart went \"boom\"\nwhen i crossed that room\nand i held her hand in mine\nwhoah, we danced through the night\nand we held each other tight\nand before too long i fell in love with her\nnow i'll never dance with another (whooh)\nwhen i saw her standing there\nwell, my heart went \"boom\"\nwhen i crossed that room\nand i held her hand in mine\nwhoah, we danced through the night\nand we held each other tight\nand before too long i fell in love with her\nnow i'll never dance with another (whooh)\nsince i saw her standing there\noh since i saw her standing there\noh since i saw her standing there"
      corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation
## drops documents
      corpus[[1]]$content #Taking a look at first song to see how it looks with no punctuation
## [1] "1234\nwell she was just seventeen\nyou know what i mean\nand the way she looked was way beyond compare\nso how could i dance with another ooh\nwhen i saw her standing there\nwell she looked at me and i i could see\nthat before too long id fall in love with her\nshe wouldnt dance with another whooh\nwhen i saw her standing there\nwell my heart went boom\nwhen i crossed that room\nand i held her hand in mine\nwhoah we danced through the night\nand we held each other tight\nand before too long i fell in love with her\nnow ill never dance with another whooh\nwhen i saw her standing there\nwell my heart went boom\nwhen i crossed that room\nand i held her hand in mine\nwhoah we danced through the night\nand we held each other tight\nand before too long i fell in love with her\nnow ill never dance with another whooh\nsince i saw her standing there\noh since i saw her standing there\noh since i saw her standing there"
      corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
       corpus[[1]]$content #Taking a look at first song to see how it looks with no numbers
## [1] "\nwell she was just seventeen\nyou know what i mean\nand the way she looked was way beyond compare\nso how could i dance with another ooh\nwhen i saw her standing there\nwell she looked at me and i i could see\nthat before too long id fall in love with her\nshe wouldnt dance with another whooh\nwhen i saw her standing there\nwell my heart went boom\nwhen i crossed that room\nand i held her hand in mine\nwhoah we danced through the night\nand we held each other tight\nand before too long i fell in love with her\nnow ill never dance with another whooh\nwhen i saw her standing there\nwell my heart went boom\nwhen i crossed that room\nand i held her hand in mine\nwhoah we danced through the night\nand we held each other tight\nand before too long i fell in love with her\nnow ill never dance with another whooh\nsince i saw her standing there\noh since i saw her standing there\noh since i saw her standing there"
      corpus <- tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
      corpus[[1]]$content #Taking a look at first song to see how it looks with no stop words
## [1] "\nwell   just seventeen\n know   mean\n  way  looked  way beyond compare\n    dance  another ooh\n  saw  standing \nwell  looked       see\n   long id fall  love  \n wouldnt dance  another whooh\n  saw  standing \nwell  heart went boom\n  crossed  room\n  held  hand  mine\nwhoah  danced   night\n  held   tight\n   long  fell  love  \nnow ill never dance  another whooh\n  saw  standing \nwell  heart went boom\n  crossed  room\n  held  hand  mine\nwhoah  danced   night\n  held   tight\n   long  fell  love  \nnow ill never dance  another whooh\nsince  saw  standing \noh since  saw  standing \noh since  saw  standing "
      corpus <- tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
      corpus[[1]]$content #Taking a look to see if destemming worked
## [1] "well just seventeen know mean way look way beyond compar danc anoth ooh saw stand well look see long id fall love wouldnt danc anoth whooh saw stand well heart went boom cross room held hand mine whoah danc night held tight long fell love now ill never danc anoth whooh saw stand well heart went boom cross room held hand mine whoah danc night held tight long fell love now ill never danc anoth whooh sinc saw stand oh sinc saw stand oh sinc saw stand"
# Now storing corpus as a Doc-term matrix
      dtm <- DocumentTermMatrix(corpus)
      dtm
## <<DocumentTermMatrix (documents: 187, terms: 1719)>>
## Non-/sparse entries: 7000/314453
## Sparsity           : 98%
## Maximal term length: 17
## Weighting          : term frequency (tf)

Question 2

2.1.

By 187 documents, we mean that there are 187 Beatles’ songs for which we have text in our data.

2.2.

By 1,719 terms, we mean that there are 1,719 unique words that appear in the lyrics of the 187 Beatles’ songs.

2.3.

By 7000/314453 Non-/sparse entries, we mean that of the 321,453 document-term combinations in our document-text matrix, 7,000 combinations are not equal to zero and 321,453 are equal to zero.

2.4.

A sparsity of 98% implies that our dataset is generally very sparse, which is to say that the bulk of document-word combinations in our DTM are zeros. After looking up how sparsity is calculated in Stack Overflow and Data Camp, it appears that 98% reflects that of the 321,453 possible document-term combinations, roughly 98% (7000/321453) are equal to zero. Note that the figure I use here (321,453) is different than the number of sparse entries reported in 2.3 above (314,453) as the figure I use represents all possible document-word combinations, not just the sparse ones.

Question 3

We can reduce the overall sparsity of our DTM by using the removeSparseTerms function. We will also need to provide an argument that tells R the level of sparsity above which we will remove certain terms from our data. The lower this argument is, the more restrictive our criteria and the smaller the resulting DTM will be.

Question 4

#Here we remove terms with a sparsity above .99
  dtm<-removeSparseTerms(dtm,0.99)
      dtm  
## <<DocumentTermMatrix (documents: 187, terms: 776)>>
## Non-/sparse entries: 6057/139055
## Sparsity           : 96%
## Maximal term length: 16
## Weighting          : term frequency (tf)
#Here we remove terms with a sparsity above .98
  dtm<-removeSparseTerms(dtm,0.98)
      dtm  
## <<DocumentTermMatrix (documents: 187, terms: 412)>>
## Non-/sparse entries: 5225/71819
## Sparsity           : 93%
## Maximal term length: 10
## Weighting          : term frequency (tf)
#Here we remove terms with a sparsity above .97
  
  dtm<-removeSparseTerms(dtm,0.97)
      dtm  
## <<DocumentTermMatrix (documents: 187, terms: 274)>>
## Non-/sparse entries: 4621/46617
## Sparsity           : 91%
## Maximal term length: 10
## Weighting          : term frequency (tf)
#Here we remove terms with a sparsity above .96
  dtm<-removeSparseTerms(dtm,0.96)
      dtm 
## <<DocumentTermMatrix (documents: 187, terms: 219)>>
## Non-/sparse entries: 4267/36686
## Sparsity           : 90%
## Maximal term length: 10
## Weighting          : term frequency (tf)

Here we see that as we move our “upper bound” of sparsity further down, the overall sparsity of our data decreases. This is to say that as we set a more strict threshold of sparsity, we remove more and more unique words from our data, thereby decreasing the overall sparsity of our DTM. We also note that as we reduce our sparsity threshold, the overall number of entries in our DTM decreases as we remove more and more terms from our DTM.

Question 5

Below we create a new document term matrix with max sparsity of 0.9

  dtm.beatles<-removeSparseTerms(dtm,0.90)
      dtm.beatles
## <<DocumentTermMatrix (documents: 187, terms: 72)>>
## Non-/sparse entries: 2515/10949
## Sparsity           : 81%
## Maximal term length: 7
## Weighting          : term frequency (tf)
    #Here we convert the DTM into dataframe
      beatles.lyrics<-as.data.frame(as.matrix(dtm.beatles))
      head(beatles.lyrics)
##   heart ill just know long look love never night now ooh see way well
## 1     2   2    1    1    3    2    3     2     2   2   1   1   2    4
## 2     0   2    0    0    0    0    0     1     0   1   3   4   0    0
## 3     2   0    2    1    0    0    7     0     0   4   0   0   0    0
## 4     0   0    0    0    0    0    6     0     0   0   0   3   0    1
## 5     0   0    0    3    0    0    0     0     0   5   1   0   0    7
## 6     0   3    0    4    0    0    6     6     0   1   0   0   0    0
##   alway back can cant cri ive one thing will world come girl leav let like
## 1     0    0   0    0   0   0   0     0    0     0    0    0    0   0    0
## 2     1    2   2    1   1   1   4     2    2     2    0    0    0   0    0
## 3     0    2   1    0   0   4   2     2    3     0    1   10    2   1    2
## 4     0    0   3    3   0   0   0     0    0     0    0    0    0   1    2
## 5     0    0   0    0   0   0   0     0    0     1    0    3    0   0    0
## 6     3    0   0    4   2   1   0     2    0     0    0    0    0   0    0
##   say tell want around away babi got pleas think yeah your dont get take
## 1   0    0    0      0    0    0   0     0     0    0    0    0   0    0
## 2   0    0    0      0    0    0   0     0     0    0    0    0   0    0
## 3   1    1    1      0    0    0   0     0     0    0    0    0   0    0
## 4   0    2    0      1    2    4   6     1     1    3    1    0   0    0
## 5   5    0    0      1    0    0   0     0     0    4    0    3   2    1
## 6   3    2    2      0    0    0   0     0     3    0    2    0   0    0
##   blue make show time need said there tri word day home hear feel though
## 1    0    0    0    0    0    0     0   0    0   0    0    0    0      0
## 2    0    0    0    0    0    0     0   0    0   0    0    0    0      0
## 3    0    0    0    0    0    0     0   0    0   0    0    0    0      0
## 4    0    0    0    0    0    0     0   0    0   0    0    0    0      0
## 5    0    0    0    0    0    0     0   0    0   0    0    0    0      0
## 6    2    1    2    1    0    0     0   0    0   0    0    0    0      0
##   yes head mind eye good that right shes sing wait man youv everyth find
## 1   0    0    0   0    0    0     0    0    0    0   0    0       0    0
## 2   0    0    0   0    0    0     0    0    0    0   0    0       0    0
## 3   0    0    0   0    0    0     0    0    0    0   0    0       0    0
## 4   0    0    0   0    0    0     0    0    0    0   0    0       0    0
## 5   0    0    0   0    0    0     0    0    0    0   0    0       0    0
## 6   0    0    0   0    0    0     0    0    0    0   0    0       0    0
##   turn
## 1    0
## 2    0
## 3    0
## 4    0
## 5    0
## 6    0

We see that setting our sparsity threshold at 0.90 we are left with only 72 terms in our DTM.

Question 6

Creating frequency plot

  freq.dtm <- sort(colSums(beatles.lyrics),decreasing=TRUE)
      freq.data <- data.frame(word = names(freq.dtm),freq=freq.dtm)
  #For the purposes of display, I'm only going to include words with more than 150 or less than 50 appearances.
      freq.data.filt <- filter(freq.data, freq>150 | freq<50)
    
      library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
      freq.plot <- ggplot(freq.data.filt, aes(reorder(word, freq), freq)) + geom_col() +
        xlab(NULL) + coord_flip() + ylab("Frequency")+
        theme(text = element_text(size = 15))
      freq.plot    

####6.1 The plot suggests that the three most common words in these Beatles’ songs (at least among those left in my data) are “love”, “know”, and “don’t”

####6.2 The plot suggests that the the three least common words in the Beatles’ songs (at least among those left in my data) are “turn”“,”show"“, and”hear".

Question 7

Here we will create a heatmap plotting correlations between words in our songs.

      library(qgraph)
## Warning: package 'qgraph' was built under R version 3.6.2
## Registered S3 methods overwritten by 'huge':
##   method    from   
##   plot.sim  BDgraph
##   print.sim BDgraph
      library(plotly)
## Warning: package 'plotly' was built under R version 3.6.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
      library(dplyr)
      cor.terms <- cor_auto(beatles.lyrics)
## Variables detected as ordinal: heart; ooh; way; alway; thing; world; leav; want; around; away; take; blue; show; said; there; tri; word; hear; though; yes; head; mind; eye; that; right; sing; wait; youv; everyth; find; turn
      a <- list(showticklabels = TRUE, tickangle = -45)
      beatles.cor.html <- plot_ly(x = colnames(cor.terms), y = colnames(cor.terms),
                          z = cor.terms, type = "heatmap") %>%
        layout(xaxis = a, showlegend = FALSE, margin = list(l=100,b=100,r=100,u=100))
      
      beatles.cor.html
      #Saving
      htmlwidgets::saveWidget(as_widget(beatles.cor.html), "./beatles.col.html")

7.1

Based on the heatmap alone, it appears that the terms “love” and “need” are the most strongly positively correlated (r=0.86).